In [9]:
# Note to presentation mode: run the first two cells before presenting.
# Render our plots inline
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
# Make the graphs a bit prettier
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (18, 5)
In [10]:
%%HTML
<!-- Not content, but for better presentation styles -->
<style>
.rise-enabled .rendered_html table.dataframe {
font-size: 16px;
}
.rendered_html code {
background: #EFEFEF;
border: 1px solid #CCC;
padding: 1px 5px;
border-radius: 1px;
}
.rise-enabled .fa-times-circle:before,
.rise-enabled .fa-question:before,
.rise-enabled aside.controls {
opacity: 0.1;
}
.rise-enabled a,
.rise-enabled a:active,
.rise-enabled a:hover,
.rise-enabled a:visited,
.rise-enabled :link {
text-decoration: none;
color: #A68910;
}
</style>
This talk is a tutorial (thanks @evildmp!) for cleaning up messy data with pandas. You might also find these other ones useful:
Slides will be available online.
There will be time for questions at the end - happy to navigate specific problems you may have.
In short: we can predict and track cyanobacteria, and keep our water healthy.
In this talk, we use pandas for data wrangling.
You can install it with your favourite package manager (instructions).
$ conda install pandas
To begin:
pandas.read_csv(): Opens a CSV file as a DataFrame, like a table.DataFrame.head(): Displays the first 5 entries.
In [11]:
import pandas as pd
# Open a comma-separated values (CSV) file as a DataFrame
weather_observations = pd.read_csv('observations/Canberra_observations.csv')
# Print the first 5 entries
weather_observations.head()
Out[11]:
The file seems to be tab seperated. There are dates, and some empty items.
pandas.read_csv() is very versatile with keyword arguments.
sep: The separator between columns.parse_dates: Treat one or more columns like dates.dayfirst: Use DD.MM.YYYY format, not month first.infer_datetime_format: Tell pandas to guess the date format.na_values: Specify values to be treated as empty.
In [12]:
# Supply pandas with some hints about the file to read
weather_observations = pd.read_csv('observations/Canberra_observations.csv',
# sep='\t',
# parse_dates={'Datetime': ['Date', 'Time']},
# dayfirst=True,
# infer_datetime_format=True,
# na_values=['-'],
)
# Display some entries
weather_observations.head()
Out[12]:
Note: NaN in the table above means empty, not the floating-point number value.
In [13]:
# For consistency between slides
weather_observations = pd.read_csv('observations/Canberra_observations.csv',
sep='\t',
parse_dates={'Datetime': ['Date', 'Time']},
dayfirst=True,
infer_datetime_format=True,
na_values=['-']
)
The data seems complete, but still has some issues:
pandas offers some functions to help us out:
DataFrame.drop_duplicates(): Delete duplicated items.DataFrame.sort_values(): Rearrange in order.DataFrame.set_index(): Specify a column to use as index.
In [14]:
# Remove duplicated items with the same date and time
no_duplicates = weather_observations.drop_duplicates('Datetime', keep='last')
# Sorting is ascending by default, or chronological order
# sorted_dataframe = no_duplicates.sort_values('Datetime')
# Use `Datetime` as our DataFrame index
# indexed_weather_observations = sorted_dataframe.set_index('Datetime')
# Display some entries
no_duplicates.head()
Out[14]:
In [15]:
# For consistency
no_duplicates = weather_observations.drop_duplicates('Datetime', keep='last')
sorted_dataframe = no_duplicates.sort_values('Datetime')
indexed_weather_observations = sorted_dataframe.set_index('Datetime')
To process wind direction, we need them in numbers.
We need to transform each direction label "N", "NNE", "NE", "ENE", "E", etc., to numbers on the column.
Series.apply(): Transforms each entry with a function.
In [16]:
# Translate wind direction to degrees
wind_directions = {
'N': 0. , 'NNE': 22.5, 'NE': 45. , 'ENE': 67.5 ,
'E': 90. , 'ESE': 112.5, 'SE': 135. , 'SSE': 157.5 ,
'S': 180. , 'SSW': 202.5, 'SW': 225. , 'WSW': 247.5 ,
'W': 270. , 'WNW': 292.5, 'NW': 315. , 'NNW': 337.5 }
In [17]:
# Replace wind directions column with a new number column
# `get()` accesses values safely from dictionary
indexed_weather_observations['Wind dir'] = \
indexed_weather_observations['Wind dir'].apply(wind_directions.get)
# Display some entries
indexed_weather_observations.head()
Out[17]:
Our data has odd timestamps from time to time, For example:
In [18]:
# One section where the data has weird timestamps ...
indexed_weather_observations[1800:1806]
Out[18]:
We can solve this. DataFrame.asfreq() forces a specific frequency on the index, discarding and filling the rest.
Let's set each data point to be every 30 minutes.
In [19]:
# Force the index to be every 30 minutes
regular_observations = indexed_weather_observations.asfreq('30min')
# Same section at different indices since setting its frequency :)
regular_observations[1633:1638]
Out[19]:
In [20]:
# Plot the first 500 entries with selected columns
regular_observations[['Wind spd', 'Wind gust', 'Tmp', 'Feels like']][:500].plot()
Out[20]:
There seems to be gaps in our dataset between 4th Jan and 9th Jan.
Weather sensors sometimes drop out.
But for modelling purposes, we need a gap-free dataset.
Series.interpolate(): Fill in empty values based on index.
In [21]:
# Interpolate data to fill empty values
for column in regular_observations.columns:
regular_observations[column].interpolate('time', inplace=True)
# Display some interpolated entries
regular_observations[1633:1638]
Out[21]:
In [22]:
# Plot it again - gap free!
regular_observations[['Wind spd', 'Wind gust', 'Tmp', 'Feels like']][:500].plot()
Out[22]:
Now the dataset is ready to be used for modelling! 🎉
Today you learned how to use pandas to in many ways:
pandas is much more powerful than what we covered today. Check out the documentation! You might find some gems.
Slides available at github.com/Spaxe/pyconau2017-messy-sensor-data
There is a bonus section on combining two DataFrames with different frequencies in the notebook.
🐹 Let's be friends on Twitter: @Xavier_Ho
In [23]:
# BONUS SECTION
# Similarly, for sky observations
sky_observations = pd.read_csv('observations/Canberra_sky.csv',
sep='\t',
parse_dates={'Datetime': ['Date', 'Time']},
dayfirst=True,
infer_datetime_format=True,
na_values=['-', 'obscured'])
sky_observations.head()
Out[23]:
In [24]:
# As before, remove duplicates and set index to datetime.
sky_observations.drop_duplicates('Datetime', keep='last', inplace=True)
sky_observations.sort_values('Datetime', inplace=True)
sky_observations.set_index('Datetime', inplace=True)
sky_observations.head()
Out[24]:
In [25]:
# Drop rows that have no data
sky_observations.dropna(how='all', inplace=True)
sky_observations.head()
Out[25]:
In [26]:
# Display the inferred data types
sky_observations.dtypes
Out[26]:
In [27]:
# What are the values in the 'Cloud' column?
sky_observations['Cloud'].unique()
Out[27]:
In [28]:
# 'obscured' means that the visibility was too low to see clouds. We will consider it to be NaN.
# Define a function to Change the 'Cloud' column to numerical values
def cloud_to_numeric(s):
if s == 'clear' or pd.isnull(s):
return 0
else:
return int(s[0]) / 8.0
# Apply the function to every item and assign it back to the original dataframe
sky_observations['Cloud'] = \
sky_observations['Cloud'].apply(cloud_to_numeric, convert_dtype=False).astype('float64')
sky_observations.head()
Out[28]:
In [29]:
# Plot the cloud cover with scatter plot using matplotlib
clouds = sky_observations[['Cloud']][:100]
plt.plot_date(clouds.index, clouds.values)
Out[29]:
In [30]:
# Join the two observations together
combined_observations = regular_observations.combine_first(sky_observations[['Cloud']])
combined_observations.head()
Out[30]:
In [ ]:
# Create a new series with 30-minutely timestamps
time_series = pd.date_range('2013-01-01', '2017-01-01', freq='30min')[:-1]
time_series
In [ ]:
# Reindex our dataset
indexed_observations = combined_observations.reindex(time_series)
indexed_observations
In [ ]:
# Display the columns in the dataset
indexed_observations.columns
In [ ]:
# Interpolate data to fill NaN
for column in indexed_observations.columns:
indexed_observations[column].interpolate('time', inplace=True, limit_direction='both')
# Preview the cleaned data
indexed_observations
In [ ]:
# Current bug in pandas fails plotting some interpolated frequencies
# see https://github.com/pandas-dev/pandas/issues/14763 (to be fixed by 31 August, 2017)
# Convert pandas DateTimeIndex to Python's datetime.datetime
timestamps = indexed_observations.index[:1000].to_pydatetime()
# Selecting a few columns to plot
selection1 = indexed_observations[['Wind spd', 'Wind gust', 'Tmp', 'Feels like']][:1000]
selection2 = indexed_observations[['Cloud']][:1000]
# For now, we copy the index and values to matplotlib
# see https://stackoverflow.com/questions/43206554/typeerror-float-argument-must-be-a-string-or-a-number-not-period/45191625#45191625
legend = plt.plot_date(timestamps, selection1.values, '-')
plt.legend(selection1.columns)
plt.show()
legend = plt.plot_date(timestamps, selection2.values, '-')
plt.legend(selection2.columns)
plt.show()